fix(perf): keep gpt-oss decode in bf16 by inureyes · Pull Request #17 · lablup/mlxcel

inureyes · 2026-05-18T10:13:37Z

Summary

GptOss single-token decode was promoting activations to FP32 inside the expert MLP and router, breaking the BF16 fast path and causing a 5–6× throughput regression versus mlx-lm.

SwiGLU: mirrors the mlx-lm activation path with a compiled activation-only helper and casts the result back to the input dtype, preventing FP32 promotion through the expert down projection on single-token decode.
MoE router: now uses precise softmax and casts expert scores/results back to the expert/input dtype so residual state remains BF16 across all layers.

Impact

On M5 Max, gpt-oss-120b-4bit decode throughput:

Build	tok/s
Before	19.49
After	112.83
mlx-lm baseline	110.35

mlxcel now slightly exceeds the mlx-lm Python baseline on this workload.

Files Touched

src/lib/mlxcel-core/cpp/mlx_cxx_bridge.{cpp,h} — new compiled activation helper
src/lib/mlxcel-core/src/lib.rs + ffi_tests.rs — Rust binding + FFI test
src/models/gpt_oss.rs — SwiGLU dtype preservation, router precise softmax + dtype casts

Test plan

cargo test -p mlxcel-core (FFI activation helper)
gpt-oss-120b-4bit decode benchmark on M5 Max — 112.83 tok/s (was 19.49)
Spot check other models still load and decode (smoke: qwen3-0.6b, llama3.1-8b)

GptOss SwiGLU now mirrors the mlx-lm activation path with a compiled activation-only helper and casts the result back to the input dtype, preventing FP32 promotion through the expert down projection on single-token decode. The MoE router now uses precise softmax and casts expert scores/results back to the expert/input dtype so residual state remains BF16 across layers. The M5 Max gpt-oss-120b-4bit benchmark improves from 19.49 tok/s to 112.83 tok/s, exceeding the 110.35 tok/s mlx-lm baseline.

inureyes mentioned this pull request May 18, 2026

chore: silence clippy 1.95 lint regressions across workspace #18

Merged

3 tasks

inureyes merged commit 42b7d36 into main May 18, 2026
4 of 5 checks passed

inureyes deleted the fix/gpt-oss-decode-bf16 branch May 18, 2026 11:26

This was referenced May 18, 2026

chore(make): add CI-faithful verify targets to Makefile #19

Merged

fix(perf): prevent fp32 promotion in model hot paths #20

Merged

inureyes self-assigned this May 18, 2026

inureyes added status:done Completed and removed status:review Under review labels May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(perf): keep gpt-oss decode in bf16#17

fix(perf): keep gpt-oss decode in bf16#17
inureyes merged 1 commit into
mainfrom
fix/gpt-oss-decode-bf16

inureyes commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

inureyes commented May 18, 2026

Summary

Impact

Files Touched

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant